Search CORE

18 research outputs found

Simpson's Bias in NLP Training

Author: Bojun Huang
Liang Yaobo
Yuan Fei
Zhang Longtu
Publication venue
Publication date: 13/03/2021
Field of study

In most machine learning tasks, we evaluate a model

M

on a given data population

S

by measuring a population-level metric

F(S;M)

. Examples of such evaluation metric

F

include precision/recall for (binary) recognition, the F1 score for multi-class classification, and the BLEU metric for language generation. On the other hand, the model

M

is trained by optimizing a sample-level loss

G(S_t;M)

at each learning step

t

, where

S_t

is a subset of

S

(a.k.a. the mini-batch). Popular choices of

G

include cross-entropy loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption behind this paradigm is that the mean value of the sample-level loss

G

, if averaged over all possible samples, should effectively represent the population-level metric

F

of the task, such as, that

\mathbb{E}[ G(S_t;M) ] \approx F(S;M)

. In this paper, we systematically investigate the above assumption in several NLP tasks. We show, both theoretically and experimentally, that some popular designs of the sample-level loss

G

may be inconsistent with the true population-level metric

F

of the task, so that models trained to optimize the former can be substantially sub-optimal to the latter, a phenomenon we call it, Simpson's bias, due to its deep connections with the classic paradox known as Simpson's reversal paradox in statistics and social sciences.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

GameEval: Evaluating LLMs on Conversational Games

Author: Duan Nan
Li Juntao
Liang Yaobo
Qiao Dan
Wu Chenfei
Publication venue
Publication date: 19/08/2023
Field of study

The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at https://github.com/GameEval/GameEval

arXiv.org e-Print Archive

Learning to Program with Natural Language

Author: Duan Nan
Guo Yiduo
Liang Yaobo
Wu Chenfei
Wu Wenshan
Zhao Dongyan
Publication venue
Publication date: 28/05/2023
Field of study

Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks, which raises hope for achieving Artificial General Intelligence. For completing the complex task, we still need a program for the task first and then ask LLMs to follow the program to generate the specific solution. We propose using natural language as a new programming language to describe task procedures, making them easily understandable to both humans and LLMs. ~The LLM is capable of directly generating natural language programs, but these programs may still contain factual errors or incomplete steps. Therefore, we further propose the Learning to Program (\text{LP}) method to ask LLMs themselves to learn the natural language program based on the training dataset of the complex task first and then use the learned program to guide the inference. Our experiments on the reasoning tasks of five different reasoning types (8 datasets) demonstrate the effectiveness of our approach. Further, our analysis experiment shows that the learned program can be directly used to guide another LLM to improve its performance, which reveals a new transfer learning paradigm.Comment: Work in progres

arXiv.org e-Print Archive

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

Author: Duan Nan
Gong Ming
Huang Haoyang
Jiang Daxin
Liang Yaobo
Shou Linjun
Zhou Ming
Publication venue
Publication date: 01/01/2019
Field of study

We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages. Comparing to similar efforts such as Multilingual BERT and XLM, three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives. We also find that doing fine-tuning on multiple languages together can bring further improvement. Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline. On XNLI, 1.8% averaged accuracy improvement (on 15 languages) is obtained. On XQA, which is a new cross-lingual dataset built by us, 5.5% averaged accuracy improvement (on French and German) is obtained.Comment: Accepted to EMNLP2019; 10 pages, 2 figure

arXiv.org e-Print Archive

Crossref

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Author: Chen Weizhu
Cui Ruixiang
Duan Nan
Guo Yiduo
Liang Yaobo
Lu Shuai
Saied Amin
Wang Yanlin
Zhong Wanjun
Publication venue
Publication date: 13/04/2023
Field of study

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/microsoft/AGIEval.Comment: 19 page

arXiv.org e-Print Archive

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Author: Duan Nan
Gong Ming
Ji Lei
Liang Yaobo
Liu Yu
Lu Shuai
Mao Shaoguang
Ou Yang
Shou Linjun
Song Ting
Wang Yun
Wu Chenfei
Wu Wenshan
Xia Yan
Publication venue
Publication date: 28/03/2023
Field of study

Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next

arXiv.org e-Print Archive